This script follows the steps necessary to reproduce the main results in Camblor et al., 2022. Mentions to Tables and Figures refer to those of this work.
library(readxl)
library(tidyverse)
library(HardyWeinberg)
library(epitools)
library(ggsignif)
library(plotly)
Our study focused on COVID-19 patients admitted to ICU. The data regarding their characterization, clinical variables and genotypes were collected in an Excel file, which can be loaded in R with:
all <- read_excel("patients_data.xlsx")
For this analysis, we want to extract the main clinical and genetic covariables that are in our interest:
data <- all %>%
select(EXITUS, AGE, SEX, HTN, DM, HLP, IL6_gt70, DD_gt2000, NFKB1, NFKBIA, NFKBIZ) %>%
mutate_if(is.character, as.factor)
With each of the variables standing for:
As part of this study, 300 controls were also genotyped for their comparison with the ICU COVID-19 patients. These data were also stored in an Excel file:
control_data <- read_excel("controls_data.xlsx")
Before the analysis, we can declare some variables to act as global definitions:
STUDY_GROUPS <- c("Survivors", "Deceased", "Patients", "Controls")
# Survivors and Deceased refer to ICU patients (EXITUS = NO/YES)
# Patients are the total of the ICU individuals
CLIN_VAR <- colnames(data)[!colnames(data) %in% c("EXITUS", "NFKB1", "NFKBIA", "NFKBIZ")]
CAT_CLIN_VAR <- CLIN_VAR[-which(CLIN_VAR=="AGE")] # Categorical clinical variables
GENES <- c("NFKB1", "NFKBIA", "NFKBIZ")
SIG_LVL <- 0.05
We can inspect our data at a glance with:
summary(data)
## EXITUS AGE SEX HTN DM HLP
## NO :371 Min. :11.00 FEMALE:129 NO :214 NO :352 NO :240
## YES: 99 1st Qu.:57.00 MALE :341 YES:256 YES :104 YES :216
## Median :65.00 NA's: 14 NA's: 14
## Mean :63.41
## 3rd Qu.:73.00
## Max. :84.00
## IL6_gt70 DD_gt2000 NFKB1 NFKBIA NFKBIZ
## NO :253 NO :309 DD: 78 AA: 81 DD: 22
## YES :164 YES :101 ID:213 GA:215 ID:158
## NA's: 53 NA's: 60 II:179 GG:174 II:290
##
##
##
The total number of COVID-19 ICU patients we are studying is 470. We can highlight a median age of 65 and the predominance of males in the ICU (both age and male sex are risk factors for severe COVID-19). Indeed, the distribution of ages is highly left-skewed:
ggplot(data, aes(x=AGE)) +
geom_histogram(aes(y = ..density..), color="black", fill="white") +
geom_density(color="red", alpha=0.1, fill="red") +
labs(x="Age", y="Density") +
theme_bw()
Indicating, as expected, that COVID-19 ICU patients tend to be of higher ages.
NA values arise for some of the clinical variables (DM, HLP, IL6_gt70, DD_gt2000), as they were not measured in all the patients. We put our focus on explaining the variable EXITUS —representing the COVID-19 severity— as a function of the rest of the variables.
In Table 1 we show the frequencies in each of the study groups (EXITUS = YES/NO), which can be obtained with the following code. The frequencies can be computed with:
clin_freqs <- list()
for (var in CAT_CLIN_VAR) {
clin_freqs[["absolute"]][[var]] <- table(data[[var]], data$EXITUS)
clin_freqs[["relative"]][[var]] <- apply(clin_freqs[["absolute"]][[var]], 2, function(x){x/sum(x)})
}
Accessing individual values in the console with:
clin_freqs$relative$SEX
##
## NO YES
## FEMALE 0.2722372 0.2828283
## MALE 0.7277628 0.7171717
Which leaves us with the following summaries, included in Table 1:
risk_groups <- lapply(clin_freqs$absolute, function(x){x[2,]}) # Extract the group of risk
abs_freq_summary <- do.call(rbind.data.frame, risk_groups) # Join data in a df
colnames(abs_freq_summary) <- STUDY_GROUPS[1:2]
rownames(abs_freq_summary) <- c("SEX (MALE)", CAT_CLIN_VAR[-which(CAT_CLIN_VAR=="SEX")])
abs_freq_summary
| Survivors | Deceased | |
|---|---|---|
| SEX (MALE) | 270 | 71 |
| HTN | 192 | 64 |
| DM | 83 | 21 |
| HLP | 164 | 52 |
| IL6_gt70 | 122 | 42 |
| DD_gt2000 | 77 | 24 |
risk_groups <- lapply(clin_freqs$relative, function(x){x[2,]}) # Extract the group of risk
rel_freq_summary <- do.call(rbind.data.frame, risk_groups) # Join data in a df
colnames(rel_freq_summary) <- STUDY_GROUPS[1:2]
rownames(rel_freq_summary) <- c("SEX (MALE)", CAT_CLIN_VAR[-which(CAT_CLIN_VAR=="SEX")])
rel_freq_summary
| Survivors | Deceased | |
|---|---|---|
| SEX (MALE) | 0.7277628 | 0.7171717 |
| HTN | 0.5175202 | 0.6464646 |
| DM | 0.2286501 | 0.2258065 |
| HLP | 0.4517906 | 0.5591398 |
| IL6_gt70 | 0.3588235 | 0.5454545 |
| DD_gt2000 | 0.2436709 | 0.2553191 |
Finally, the mean ages ± standard deviation in the study groups (also included in Table 1) are:
data %>%
group_by(EXITUS) %>%
summarize(Mean=mean(AGE), SD=sd(AGE))
| EXITUS | Mean | SD |
|---|---|---|
| NO | 61.58760 | 12.42468 |
| YES | 70.26263 | 10.23159 |
In order to explore the relationship between the clinical variables and EXITUS in our data, we performed a univariate logistic regression for each of them as independent variables, placing EXITUS as the dependent variable.
Each of the logistic regressions can be gathered in a single list with:
exitus_clinical_lrs <- list()
for (var in CLIN_VAR) {
exitus_clinical_lrs[[var]] <- glm(data$EXITUS ~ data[[var]], family="binomial")
}
Accessing the elements in the console, specially for a summary, with:
summary(exitus_clinical_lrs[["SEX"]])
##
## Call:
## glm(formula = data$EXITUS ~ data[[var]], family = "binomial")
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6996 -0.6833 -0.6833 -0.6833 1.7715
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.28292 0.21358 -6.007 1.89e-09 ***
## data[[var]]MALE -0.05283 0.25180 -0.210 0.834
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 483.91 on 469 degrees of freedom
## Residual deviance: 483.87 on 468 degrees of freedom
## AIC: 487.87
##
## Number of Fisher Scoring iterations: 4
A summary table of the logistic regression results, which was incorporated in Table 1 of the study, was made using the following code:
glm_table <- data.frame("p-value"=numeric(0), "OR"=numeric(0), "Upper_95CI"=numeric(0), "Lower_95CI"=numeric(0))
for (var in CLIN_VAR) {
fit <- glm(data$EXITUS ~ data[[var]], family="binomial")
p <- summary(fit)$coefficients[2,4]
or <- exp(coef(fit))[2]
ub <- exp(confint(fit))[2,1]
lb <- exp(confint(fit))[2,2]
glm_table[var,] <- c(p, or, ub, lb)
}
glm_table
| p.value | OR | Upper_95CI | Lower_95CI | |
|---|---|---|---|---|
| AGE | 0.0000000 | 1.0819502 | 1.0552660 | 1.111718 |
| SEX | 0.8338289 | 0.9485450 | 0.5840536 | 1.572021 |
| HTN | 0.0229231 | 1.7047619 | 1.0825549 | 2.720629 |
| DM | 0.9534990 | 0.9839357 | 0.5603686 | 1.672260 |
| HLP | 0.0653935 | 1.5389649 | 0.9745987 | 2.443747 |
| IL6_gt70 | 0.0028068 | 2.1442623 | 1.3021251 | 3.551732 |
| DD_gt2000 | 0.8180296 | 1.0641929 | 0.6175902 | 1.789455 |
In the case of the age, a logistic regression answers a different question than what we are interested in: to test the differences in means. The Shapiro-Wilk test is convincing in the lack of normality in the AGE data, but a t-test can still be performed as per the Central Limit Theorem, which considers the approximation of sample means to normality in large samples (in this case, we have groups of N=371 and N=99).
t.test(data$AGE ~ data$EXITUS) # Welch approximation (variances unequal)
##
## Welch Two Sample t-test
##
## data: data$AGE by data$EXITUS
## t = -7.1465, df = 182.8, p-value = 2.054e-11
## alternative hypothesis: true difference in means between group NO and group YES is not equal to 0
## 95 percent confidence interval:
## -11.07006 -6.27999
## sample estimates:
## mean in group NO mean in group YES
## 61.58760 70.26263
A very low p-value (<< 0.05) makes certain a difference in the mean ages: deceased COVID-19 patients at the ICU are of an older age.
In order to extract the genotype and allele frequencies —both
absolute and relative— from the study data, the
variant_freqs function was written.
#' Variant frequencies
#'
#' Compute the genetic (genotype and allele) frequencies for variant genotype observations stored
#' in a vector
#'
#' @param variant A chr vector of genotypes
#' @param group A grouping vector for the genotypes (optional)
#' @param margin How the relative frequencies will be computed with grouping variables:
#' 2, genotype proportions for each group; 1, distribution among groups for each genotype
#'
#' @return A list with both absolute and relative frequencies for genotypes and alleles
variant_freqs <- function(variant, group=NULL, margin=2) {
frequencies <- list()
if (is.null(group)) {margin=2} # Force margin 2 if only the genotypes are provided
# Compute the frequencies for the genotypes. To simplify the code, it will be working with
# matrices: if the input of the user is only a variant, create a temporary column = c(1, 1, 1)
if (is.null(group)) {frequencies$genotype_abs <- cbind(table(variant), c(1, 1, 1))}
else {frequencies$genotype_abs <- table(variant, group)}
frequencies$genotype_rel <- apply(frequencies$genotype_abs, margin, function(x){x/sum(x)})
# Compute the frequencies for the alleles: 2 * homozygote count + heterozygote count
first_allele_freq <- frequencies$genotype_abs[3,] * 2 + frequencies$genotype_abs[2,]
second_allele_freq <- frequencies$genotype_abs[1,] * 2 + frequencies$genotype_abs[2,]
frequencies$allele_abs <- rbind(first_allele_freq, second_allele_freq)
rownames(frequencies$allele_abs) <- c(substr(rownames(frequencies$genotype_abs)[3], 1, 1),
substr(rownames(frequencies$genotype_abs)[1], 1, 1))
frequencies$allele_rel <- apply(frequencies$allele_abs, margin, function(x){x/sum(x)})
# If no group was provided, remove the temporary column
if (is.null(group)) {frequencies <- lapply(frequencies, function(x){x[,1]})}
return(frequencies)
}
Being able to compute the frequencies among the ICU patients with:
NFKB1 <- variant_freqs(data$NFKB1, data$EXITUS)
NFKBIA <- variant_freqs(data$NFKBIA, data$EXITUS)
NFKBIZ <- variant_freqs(data$NFKBIZ, data$EXITUS)
The total frequencies for the ICU patients are computed with:
NFKB1_ICU <- variant_freqs(data$NFKB1)
NFKBIA_ICU <- variant_freqs(data$NFKBIA)
NFKBIZ_ICU <- variant_freqs(data$NFKBIZ)
And finally, the frequencies for the controls are:
NFKB1_controls <- variant_freqs(control_data$NFKB1)
NFKBIA_controls <- variant_freqs(control_data$NFKBIA)
NFKBIZ_controls <- variant_freqs(control_data$NFKBIZ)
This leaves us with the following frequency summaries, which were used for Table 3:
The absolute frequencies for rs28362491 are:
NFKB1_abs <- as.data.frame(
rbind(cbind(t(NFKB1$genotype_abs), t(NFKB1$allele_abs)),
c(NFKB1_ICU$genotype_abs, NFKB1_ICU$allele_abs),
c(NFKB1_controls$genotype_abs, NFKB1_controls$allele_abs))
)
rownames(NFKB1_abs) <- STUDY_GROUPS; NFKB1_abs
| DD | ID | II | I | D | |
|---|---|---|---|---|---|
| Survivors | 62 | 166 | 143 | 452 | 290 |
| Deceased | 16 | 47 | 36 | 119 | 79 |
| Patients | 78 | 213 | 179 | 571 | 369 |
| Controls | 46 | 142 | 112 | 366 | 234 |
The relative frequencies are:
NFKB1_rel <- as.data.frame(
rbind(cbind(t(NFKB1$genotype_rel), t(NFKB1$allele_rel)),
c(NFKB1_ICU$genotype_rel, NFKB1_ICU$allele_rel),
c(NFKB1_controls$genotype_rel, NFKB1_controls$allele_rel))
)
rownames(NFKB1_rel) <- STUDY_GROUPS; round(NFKB1_rel, 2)
| DD | ID | II | I | D | |
|---|---|---|---|---|---|
| Survivors | 0.17 | 0.45 | 0.39 | 0.61 | 0.39 |
| Deceased | 0.16 | 0.47 | 0.36 | 0.60 | 0.40 |
| Patients | 0.17 | 0.45 | 0.38 | 0.61 | 0.39 |
| Controls | 0.15 | 0.47 | 0.37 | 0.61 | 0.39 |
The absolute frequencies for rs696 are:
NFKBIA_abs <- as.data.frame(
rbind(cbind(t(NFKBIA$genotype_abs), t(NFKBIA$allele_abs)),
c(NFKBIA_ICU$genotype_abs, NFKBIA_ICU$allele_abs),
c(NFKBIA_controls$genotype_abs, NFKBIA_controls$allele_abs))
)
rownames(NFKBIA_abs) <- STUDY_GROUPS; NFKBIA_abs
| AA | GA | GG | G | A | |
|---|---|---|---|---|---|
| Survivors | 64 | 175 | 132 | 439 | 303 |
| Deceased | 17 | 40 | 42 | 124 | 74 |
| Patients | 81 | 215 | 174 | 563 | 377 |
| Controls | 61 | 149 | 90 | 329 | 271 |
The relative frequencies are:
NFKBIA_rel <- as.data.frame(
rbind(cbind(t(NFKBIA$genotype_rel), t(NFKBIA$allele_rel)),
c(NFKBIA_ICU$genotype_rel, NFKBIA_ICU$allele_rel),
c(NFKBIA_controls$genotype_rel, NFKBIA_controls$allele_rel))
)
rownames(NFKBIA_rel) <- STUDY_GROUPS; round(NFKBIA_rel, 2)
| AA | GA | GG | G | A | |
|---|---|---|---|---|---|
| Survivors | 0.17 | 0.47 | 0.36 | 0.59 | 0.41 |
| Deceased | 0.17 | 0.40 | 0.42 | 0.63 | 0.37 |
| Patients | 0.17 | 0.46 | 0.37 | 0.60 | 0.40 |
| Controls | 0.20 | 0.50 | 0.30 | 0.55 | 0.45 |
The absolute frequencies for rs3217713 are:
NFKBIZ_abs <- as.data.frame(
rbind(cbind(t(NFKBIZ$genotype_abs), t(NFKBIZ$allele_abs)),
c(NFKBIZ_ICU$genotype_abs, NFKBIZ_ICU$allele_abs),
c(NFKBIZ_controls$genotype_abs, NFKBIZ_controls$allele_abs))
)
rownames(NFKBIZ_abs) <- STUDY_GROUPS; NFKBIZ_abs
| DD | ID | II | I | D | |
|---|---|---|---|---|---|
| Survivors | 19 | 133 | 219 | 571 | 171 |
| Deceased | 3 | 25 | 71 | 167 | 31 |
| Patients | 22 | 158 | 290 | 738 | 202 |
| Controls | 12 | 87 | 189 | 465 | 111 |
The relative frequencies are:
NFKBIZ_rel <- as.data.frame(
rbind(cbind(t(NFKBIZ$genotype_rel), t(NFKBIZ$allele_rel)),
c(NFKBIZ_ICU$genotype_rel, NFKBIZ_ICU$allele_rel),
c(NFKBIZ_controls$genotype_rel, NFKBIZ_controls$allele_rel))
)
rownames(NFKBIZ_rel) <- STUDY_GROUPS; round(NFKBIZ_rel, 2)
| DD | ID | II | I | D | |
|---|---|---|---|---|---|
| Survivors | 0.05 | 0.36 | 0.59 | 0.77 | 0.23 |
| Deceased | 0.03 | 0.25 | 0.72 | 0.84 | 0.16 |
| Patients | 0.05 | 0.34 | 0.62 | 0.79 | 0.21 |
| Controls | 0.04 | 0.30 | 0.66 | 0.81 | 0.19 |
Hardy-Weinberg Equilibrium (HWE) testing is commonly used in genetic association studies as a quality control, as departure from its criteria indicates an underlying unwanted variation that may lead to problems in the interpretation of the results.
The testing is performed for each set of genotype frequencies by means of a Chi-squared test against the expected frequencies that would result of meeting the HWE criteria. We can obtain a simple summary of p-values with:
# The get() function is used to translate the strings in GENES to variable names
for (gene in GENES) {
print(paste("-----", gene, "----"))
cat("Survivors:\n ")
print(
HWChisq(get(gene)$genotype_abs[,"NO"], verbose=FALSE)$pval
)
cat("Deceased:\n ")
print(
HWChisq(get(gene)$genotype_abs[,"YES"], verbose=FALSE)$pval
)
cat("ICU:\n ")
print(
HWChisq(get(paste0(gene, "_ICU"))$genotype_abs, verbose=FALSE)$pval
)
cat("Controls:\n ")
print(
HWChisq(get(paste0(gene, "_controls"))$genotype_abs, verbose=FALSE)$pval
)
cat("\n")
}
## [1] "----- NFKB1 ----"
## Survivors:
## [1] 0.2804542
## Deceased:
## [1] 0.9372743
## ICU:
## [1] 0.3149684
## Controls:
## [1] 0.9757332
##
## [1] "----- NFKBIA ----"
## Survivors:
## [1] 0.7053844
## Deceased:
## [1] 0.2306533
## ICU:
## [1] 0.3349555
## Controls:
## [1] 0.9599656
##
## [1] "----- NFKBIZ ----"
## Survivors:
## [1] 0.9262173
## Deceased:
## [1] 0.8924809
## ICU:
## [1] 0.9632842
## Controls:
## [1] 0.7360303
With all the genotypes meeting the HWE criteria (p-value > 0.05), we can now proceed to the proper genetic association analysis.
In this section we’ll perform the association (Chi-squared) tests between the genotype frequencies and the EXITUS variable, outputting some of the results placed in Table 3.
We will begin analyzing the deceased vs. survivors data (again, COVID-19 patients in the ICU). A preliminary Chi-squared test can be performed for the genotypes in each of the three polymorphisms with:
chisq_NFKB1 <- chisq.test(NFKB1$genotype_abs); chisq_NFKB1
##
## Pearson's Chi-squared test
##
## data: NFKB1$genotype_abs
## X-squared = 0.24042, df = 2, p-value = 0.8867
chisq_NFKBIA <- chisq.test(NFKBIA$genotype_abs); chisq_NFKBIA
##
## Pearson's Chi-squared test
##
## data: NFKBIA$genotype_abs
## X-squared = 1.7712, df = 2, p-value = 0.4125
chisq_NFKBIZ <- chisq.test(NFKBIZ$genotype_abs); chisq_NFKBIZ
##
## Pearson's Chi-squared test
##
## data: NFKBIZ$genotype_abs
## X-squared = 5.3789, df = 2, p-value = 0.06792
Observing a trend towards significance in the NFKBIZ rs3217713 genotypes. Exploring the deviation in the observed values from the expected values in each case, we can group genotypes based on similar trends to further improve the statistical power in the tests to follow. This way:
chisq_NFKB1$observed-chisq_NFKB1$expected
## group
## variant NO YES
## DD 0.4297872 -0.4297872
## ID -2.1340426 2.1340426
## II 1.7042553 -1.7042553
For the NFKB1 rs28362491 genotypes, heterozygotes are more frequent in deceased patients vs. survivors: ID / II+DD. However, the interpretation of a heterozygote effect is harder to elucidate, so we chose to group ID heterozygotes to the homozygote group with the nearest value of change; i.e: ID+DD / II.
chisq_NFKBIA$observed-chisq_NFKBIA$expected
## group
## variant NO YES
## AA 0.06170213 -0.06170213
## GA 5.28723404 -5.28723404
## GG -5.34893617 5.34893617
For the NFKBIA rs696 genotypes, GG homozygosis is more frequent in deceased patients vs. survivors: GG / GA+AA.
chisq_NFKBIZ$observed-chisq_NFKBIZ$expected
## group
## variant NO YES
## DD 1.634043 -1.634043
## ID 8.280851 -8.280851
## II -9.914894 9.914894
Finally, for the NFKBIZ rs3217713 genotypes, the II homozygosis is more frequent in deceased patients vs. survivors: II / ID+DD.
As we will be working with them later, these groups can be reflected in the data with:
data <- data %>%
mutate(NFKB1_GROUP = ifelse(NFKB1=="II", "II", "ID+DD")) %>%
mutate(NFKBIA_GROUP = ifelse(NFKBIA=="GG", "GG", "GA+AA")) %>%
mutate(NFKBIZ_GROUP = ifelse(NFKBIZ=="II", "II", "ID+DD"))
Finally, we will be executing the new Chi-squared tests. The default Yates’ correction for continuity will not be considered, as no expected values < 5 were found (and as it is regarded by some authors as too conservative). The final genotype association tests were:
chisq.test(data$EXITUS, data$NFKB1_GROUP, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: data$EXITUS and data$NFKB1_GROUP
## X-squared = 0.15762, df = 1, p-value = 0.6914
chisq.test(data$EXITUS, data$NFKBIA_GROUP, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: data$EXITUS and data$NFKBIA_GROUP
## X-squared = 1.5703, df = 1, p-value = 0.2102
chisq.test(data$EXITUS, data$NFKBIZ_GROUP, correct=FALSE) # Significant
##
## Pearson's Chi-squared test
##
## data: data$EXITUS and data$NFKBIZ_GROUP
## X-squared = 5.3234, df = 1, p-value = 0.02104
Where we then found a significant association of the NFKBIZ rs3217713 II genotype to mortality (frequency raised in deceased patients, as previously stated). The odds ratios can be computed with:
oddsratio.wald(t(table(data$EXITUS, data$NFKB1_GROUP)), rev="r")$measure
## odds ratio with 95% C.I.
## estimate lower upper
## II 1.000000 NA NA
## ID+DD 1.097588 0.6930439 1.738272
oddsratio.wald(t(table(data$EXITUS, data$NFKBIA_GROUP)))$measure
## odds ratio with 95% C.I.
## estimate lower upper
## GA+AA 1.000000 NA NA
## GG 1.334131 0.8492349 2.095892
oddsratio.wald(t(table(data$EXITUS, data$NFKBIZ_GROUP)))$measure
## odds ratio with 95% C.I.
## estimate lower upper
## ID+DD 1.000000 NA NA
## II 1.759948 1.084839 2.855186
We can now perform the analyses to compare the genotype frequencies in the total of the ICU patients (deceased + survivors) against our controls.
First, we establish the comparison matrices with the absolute frequencies previously gathered:
NFKB1_genotype_matrix <- cbind(NFKB1_ICU$genotype_abs, NFKB1_controls$genotype_abs)
colnames(NFKB1_genotype_matrix) <- STUDY_GROUPS[3:4]
NFKBIA_genotype_matrix <- cbind(NFKBIA_ICU$genotype_abs, NFKBIA_controls$genotype_abs)
colnames(NFKBIA_genotype_matrix) <- STUDY_GROUPS[3:4]
NFKBIZ_genotype_matrix <- cbind(NFKBIZ_ICU$genotype_abs, NFKBIZ_controls$genotype_abs)
colnames(NFKBIZ_genotype_matrix) <- STUDY_GROUPS[3:4]
We perform an initial set of Chi-squared tests:
chisq_NFKB1_pvc <- chisq.test(NFKB1_genotype_matrix); chisq_NFKB1_pvc
##
## Pearson's Chi-squared test
##
## data: NFKB1_genotype_matrix
## X-squared = 0.36974, df = 2, p-value = 0.8312
chisq_NFKBIA_pvc <- chisq.test(NFKBIA_genotype_matrix); chisq_NFKBIA_pvc
##
## Pearson's Chi-squared test
##
## data: NFKBIA_genotype_matrix
## X-squared = 4.1826, df = 2, p-value = 0.1235
chisq_NFKBIZ_pvc <- chisq.test(NFKBIZ_genotype_matrix); chisq_NFKBIZ_pvc
##
## Pearson's Chi-squared test
##
## data: NFKBIZ_genotype_matrix
## X-squared = 1.1821, df = 2, p-value = 0.5538
We assess which genotypes can be grouped based on equivalent changes among groups:
chisq_NFKB1_pvc$observed-chisq_NFKB1_pvc$expected # DD / II+ID (to avoid testing heterozygosis vs. homozygosis)
## Patients Controls
## DD 2.311688 -2.311688
## ID -3.688312 3.688312
## II 1.376623 -1.376623
chisq_NFKBIA_pvc$observed-chisq_NFKBIA_pvc$expected # GG / GA+AA
## Patients Controls
## AA -5.675325 5.675325
## GA -7.181818 7.181818
## GG 12.857143 -12.857143
chisq_NFKBIZ_pvc$observed-chisq_NFKBIZ_pvc$expected # ID+DD / II
## Patients Controls
## DD 0.9182058 -0.9182058
## ID 6.0870712 -6.0870712
## II -7.0052770 7.0052770
We establish the groups:
grouped_NFKB1_genotype_matrix <- rbind(NFKB1_genotype_matrix["DD",],
NFKB1_genotype_matrix["ID",] + NFKB1_genotype_matrix["II",])
grouped_NFKBIA_genotype_matrix <- rbind(NFKBIA_genotype_matrix["GG",],
NFKBIA_genotype_matrix["GA",] + NFKBIA_genotype_matrix["AA",])
grouped_NFKBZ_genotype_matrix <- rbind(NFKBIZ_genotype_matrix["ID",] + NFKBIZ_genotype_matrix["DD",],
NFKBIZ_genotype_matrix["II",])
The final genotype association tests were:
chisq.test(grouped_NFKB1_genotype_matrix, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: grouped_NFKB1_genotype_matrix
## X-squared = 0.216, df = 1, p-value = 0.6421
chisq.test(grouped_NFKBIA_genotype_matrix, correct=FALSE) # Significant
##
## Pearson's Chi-squared test
##
## data: grouped_NFKBIA_genotype_matrix
## X-squared = 4.0067, df = 1, p-value = 0.04532
chisq.test(grouped_NFKBZ_genotype_matrix, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: grouped_NFKBZ_genotype_matrix
## X-squared = 1.1815, df = 1, p-value = 0.2771
Where we find that, although not conclusively, the p-value for NFKBIA < 0.05 states a significant deviation in the genotype frequencies of the rs696 polymorphism in ICU patients vs. controls: GG is higher in the patients. Finally, we compute the odds ratios:
oddsratio.wald(grouped_NFKB1_genotype_matrix)$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## [1,] 1.000000 NA NA
## [2,] 1.098713 0.7386336 1.63433
oddsratio.wald(grouped_NFKBIA_genotype_matrix)$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## [1,] 1.000000 NA NA
## [2,] 1.371622 1.006124 1.869895
oddsratio.wald(grouped_NFKBZ_genotype_matrix)$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## [1,] 1.000000 NA NA
## [2,] 1.184953 0.87247 1.609355
Here, we will perform the equivalent calculations for the allele frequencies, completing the results of Table 3.
We will be working with the two groups of ICU patients. The Chi-squared tests are directly performed with:
chisq.test(NFKB1$allele_abs, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: NFKB1$allele_abs
## X-squared = 0.043582, df = 1, p-value = 0.8346
chisq.test(NFKBIA$allele_abs, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: NFKBIA$allele_abs
## X-squared = 0.77976, df = 1, p-value = 0.3772
chisq.test(NFKBIZ$allele_abs, correct=FALSE) # Significant
##
## Pearson's Chi-squared test
##
## data: NFKBIZ$allele_abs
## X-squared = 5.0581, df = 1, p-value = 0.02451
And then, the odds ratios are computed with:
oddsratio.wald(NFKB1$allele_abs)$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## I 1.000000 NA NA
## D 1.034715 0.7510609 1.425496
oddsratio.wald(NFKBIA$allele_abs, rev="r")$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## A 1.00000 NA NA
## G 1.15656 0.8373311 1.597493
oddsratio.wald(NFKBIZ$allele_abs, rev="r")$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## D 1.000000 NA NA
## I 1.613299 1.060375 2.45454
Now, we are comparing the total of ICU patients against the controls. We establish the matrices, joining the patients and controls data:
NFKB1_allele_matrix <- rbind(NFKB1_ICU$allele_abs, NFKB1_controls$allele_abs)
NFKBIA_allele_matrix <- rbind(NFKBIA_ICU$allele_abs, NFKBIA_controls$allele_abs)
NFKBIZ_allele_matrix <- rbind(NFKBIZ_ICU$allele_abs, NFKBIZ_controls$allele_abs)
The Chi-squared tests are performed with:
chisq.test(NFKB1_allele_matrix, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: NFKB1_allele_matrix
## X-squared = 0.010021, df = 1, p-value = 0.9203
chisq.test(NFKBIA_allele_matrix, correct=FALSE) # Significant
##
## Pearson's Chi-squared test
##
## data: NFKBIA_allele_matrix
## X-squared = 3.8478, df = 1, p-value = 0.04981
chisq.test(NFKBIZ_allele_matrix, correct=FALSE)
##
## Pearson's Chi-squared test
##
## data: NFKBIZ_allele_matrix
## X-squared = 1.0729, df = 1, p-value = 0.3003
And then, the odds ratios are computed with:
oddsratio.wald(NFKB1_allele_matrix, rev="r")$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## [1,] 1.000000 NA NA
## [2,] 1.010777 0.8194171 1.246826
oddsratio.wald(NFKBIA_allele_matrix)$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## [1,] 1.0000 NA NA
## [2,] 1.2301 1.000059 1.513057
oddsratio.wald(NFKBIZ_allele_matrix, rev="r")$measure
## NA
## odds ratio with 95% C.I. estimate lower upper
## [1,] 1.000000 NA NA
## [2,] 1.146634 0.8849813 1.485648
To end the association study, we would be interested to test whether
the genotypes of the three variants are associated to the clinical
variables or not. The frequencies for each genotype were determined
using the variant_freqs function, specifying the clinical
variables as grouping variables instead of EXITUS; e.g.:
variant_freqs(data$NFKB1, data$SEX, 1)
## $genotype_abs
## group
## variant FEMALE MALE
## DD 20 58
## ID 64 149
## II 45 134
##
## $genotype_rel
## variant
## group DD ID II
## FEMALE 0.2564103 0.3004695 0.2513966
## MALE 0.7435897 0.6995305 0.7486034
##
## $allele_abs
## FEMALE MALE
## I 154 417
## D 104 265
##
## $allele_rel
## I D
## FEMALE 0.2697023 0.2818428
## MALE 0.7302977 0.7181572
# We use margin=1 to compare relative frequencies inside grouping variables
To explore the relationships between the variants and the clinical variables, we worked with the genotype groups previously established. A list of all the tests of interest was created with:
genotype_groups <- c("NFKB1_GROUP", "NFKBIA_GROUP", "NFKBIZ_GROUP")
genotypes_clin_associations <- list()
for (genotypes in genotype_groups) {
for (var in CAT_CLIN_VAR) {
genotypes_clin_associations[[genotypes]][[var]] <- chisq.test(data[[genotypes]], data[[var]], correct=FALSE)
}
}
Accessing individual tests in the console with:
genotypes_clin_associations$NFKB1_GROUP$SEX
##
## Pearson's Chi-squared test
##
## data: data[[genotypes]] and data[[var]]
## X-squared = 0.77279, df = 1, p-value = 0.3794
Differences in age among groups were tested with:
for (group in genotype_groups) {
print(t.test(data$AGE ~ data[[group]]))
}
These results were collected for NFKB1, NFKBIA and NFKBIZ separately in the Supplementary Tables 1A, 1B and 1C, respectively.
The main finding was a significant association of the NFKBIZ rs3217713 polymorphism to values of D-Dimer above 2000 ng/mL, as seen in:
genotypes_clin_associations$NFKBIZ_GROUP$DD_gt2000
##
## Pearson's Chi-squared test
##
## data: data[[genotypes]] and data[[var]]
## X-squared = 7.0801, df = 1, p-value = 0.007794
We can determine the nature of this association examining the residuals:
genotypes_clin_associations$NFKBIZ_GROUP$DD_gt2000$residuals
## data[[var]]
## data[[genotypes]] NO YES
## ID+DD 1.0496576 -1.8359726
## II -0.8014681 1.4018605
Finding that the II genotype is found more frequently than it could be expected in patients with D-Dimer above 2000 ng/mL.
The odds ratio was computed with:
fit <- glm(data$DD_gt2000 ~ data$NFKBIZ_GROUP, family="binomial")
exp(cbind(coef(fit), confint(fit)))
## 2.5 % 97.5 %
## (Intercept) 0.208000 0.133444 0.311754
## data$NFKBIZ_GROUPII 1.959657 1.200928 3.279022
The information we are going to represent is the proportion of patients with each NFKBIZ rs3217713 genotype that have D-Dimer levels above 2000 ng/mL, being the association of the II polymorphism to high D-Dimer levels a major finding in our study. The frequencies we are representing are:
freqs <- variant_freqs(data$NFKBIZ, data$DD_gt2000, 1)$genotype_rel
freqs <- t(freqs); freqs
## group
## variant NO YES
## DD 0.8571429 0.1428571
## ID 0.8248175 0.1751825
## II 0.7104247 0.2895753
Specifically, we are interested in those of the “YES” group (D-Dimer levels above 2000 ng/mL).
With them, we can now elaborate the plot that corresponds to Figure 1:
freqs <- as.data.frame(freqs) %>%
rownames_to_column("GENOTYPE") %>%
pivot_longer(cols=-GENOTYPE, names_to="DD_gt2000", values_to="FREQUENCY") %>%
mutate_if(is.character, as.factor)
figure_1 <- freqs %>%
filter(DD_gt2000=="YES") %>%
ggplot(aes(x=fct_rev(GENOTYPE), y=FREQUENCY, fill=GENOTYPE)) +
geom_bar(stat="identity", show.legend=FALSE, color="black", size=1.5) +
geom_text(aes(label=round(FREQUENCY, 2)), vjust=-0.80, size=7.5) +
ylim(0, 0.45) +
scale_fill_manual(values=c("#f8e5f8", "#f8e5f8", "#e699e0")) +
labs(x=element_blank(), y="D-dimer > 2000 ng/mL proportion") +
geom_signif(comparisons = list(c("II","DD")), y_position = 0.35,
tip_length = 0, vjust = -1, size=1, annotations = "II vs. ID+DD: p = 0.0078", textsize = 7) +
theme_classic() +
theme(
plot.title = element_text(hjust = 0.5, margin=margin(b=20), face="bold", size=20),
axis.title.y = element_text(face="bold", size=20, margin=margin(r=20)),
axis.text.x = element_text(face="bold", size=17),
axis.text.y = element_text(face="bold", size=17),
strip.text.x = element_text(face="bold")) +
theme(plot.margin = unit(c(0.3,0.2,0.2,0.2), "cm"))
figure_1
To explore a functional role of the NFKBIZ rs3217713 variant, we performed an expression study of the NFKBIZ transcript in the blood of some of the patients in our cohort (N = 52). Expression was determined in Ct values —the number of qPCR cycles needed to cross a threshold of significant fluorescent signal—, which are inversely proportional to the amount of transcript of the gene. The Ct value for each patient was measured in triplicate, computing the correspondent mean for each of them. The expression of the ACTB gene in Ct values was also determined; as it is a housekeeping gene with constant expression, it allows us to normalize the Ct values of the NFKBIZ gene to account for experimental variability. This can be done by dividing the Ct value of NFKBIZ by the Ct value of ACTB (computing a Ct ratio), for each patient. Again, this ratio is inversely proportional to the expression. These calculations were performed off the script and are our starting point for this last section.
We load the data (which are gathered in Supplementary Table 2) with:
NFKBIZ_expression <- read_excel("NFKBIZ_expression.xlsx")
We want to study the differences in expression as a function of the groups previously determined in the genetic association study:
NFKBIZ_expression <- NFKBIZ_expression %>%
mutate(GENOTYPE = ifelse(GENOTYPE=="II", "II", "ID+DD")) %>%
mutate_if(is.character, as.factor)
We can visually represent the data with a box plot:
p <- NFKBIZ_expression %>%
ggplot(aes(x=fct_rev(GENOTYPE), y=CT_RATIO, fill=GENOTYPE)) +
geom_boxplot() +
theme_bw() +
ylab("NFKBIZ/ACTB Ct ratio") +
scale_fill_manual(values=c("#f8e5f8", "#e699e0")) +
theme(
legend.position="none",
axis.title.y = element_text(face="bold", size=16, margin=margin(r=20)),
axis.title.x = element_blank(),
axis.text.x = element_text(face="bold", size=14),
axis.text.y = element_text(face="bold", size=14),
strip.text.x = element_text(face="bold"))
ggplotly(p) %>% config(displayModeBar = FALSE)
A necessary step before performing the t-test is checking for the normality of the expression data (independence, random sampling conditions are met), which was done executing a Shapiro-Wilk test:
shapiro.test(NFKBIZ_expression$CT_RATIO)
##
## Shapiro-Wilk normality test
##
## data: NFKBIZ_expression$CT_RATIO
## W = 0.97354, p-value = 0.2969
Accepting the normal distribution (null hypothesis; p > 0.05). The homogeneity of variances was tested with an F-test:
var.test(NFKBIZ_expression$CT_RATIO ~ NFKBIZ_expression$GENOTYPE)
##
## F test to compare two variances
##
## data: NFKBIZ_expression$CT_RATIO by NFKBIZ_expression$GENOTYPE
## F = 2.1737, num df = 20, denom df = 30, p-value = 0.0529
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9902134 5.1051034
## sample estimates:
## ratio of variances
## 2.173677
Which would lead to accepting the null hypothesis of equal variances, but with a rather inconclusive p-value.
The t-test with equal variances was:
t.test(NFKBIZ_expression$CT_RATIO ~ NFKBIZ_expression$GENOTYPE, var.equal=TRUE) # Significant
##
## Two Sample t-test
##
## data: NFKBIZ_expression$CT_RATIO by NFKBIZ_expression$GENOTYPE
## t = -2.0715, df = 50, p-value = 0.04349
## alternative hypothesis: true difference in means between group ID+DD and group II is not equal to 0
## 95 percent confidence interval:
## -0.099653442 -0.001538263
## sample estimates:
## mean in group ID+DD mean in group II
## 1.267043 1.317639
And the t-test with no assumption of variance equality was:
t.test(NFKBIZ_expression$CT_RATIO ~ NFKBIZ_expression$GENOTYPE) # Defaults to Welch's approximation
##
## Welch Two Sample t-test
##
## data: NFKBIZ_expression$CT_RATIO by NFKBIZ_expression$GENOTYPE
## t = -1.9261, df = 32.316, p-value = 0.06292
## alternative hypothesis: true difference in means between group ID+DD and group II is not equal to 0
## 95 percent confidence interval:
## -0.104081416 0.002889711
## sample estimates:
## mean in group ID+DD mean in group II
## 1.267043 1.317639
Both results do not give an unanimous statement, being around the significance level of 0.05. We chose to declare non-significance in favor of being conservative, but a statistical trend is seen nevertheless. This trend refers to a higher mean of the Ct NFKBIZ/ACTB ratio in patients with the NFKBIZ rs3217713 II genotype, thus suggesting a lower expression of NFKBIZ in this group and indicating a possible functional role for the variant.
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=es_ES.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.0 ggsignif_0.6.3 epitools_0.5-10.1
## [4] HardyWeinberg_1.7.5 nnet_7.3-17 Rsolnp_1.16
## [7] mice_3.14.0 forcats_0.5.1 stringr_1.4.0
## [10] dplyr_1.0.9 purrr_0.3.4 readr_2.1.2
## [13] tidyr_1.2.0 tibble_3.1.7 ggplot2_3.3.6
## [16] tidyverse_1.3.1 readxl_1.4.0
##
## loaded via a namespace (and not attached):
## [1] httr_1.4.3 sass_0.4.1 jsonlite_1.8.0 viridisLite_0.4.0
## [5] modelr_0.1.8 bslib_0.3.1 assertthat_0.2.1 highr_0.9
## [9] cellranger_1.1.0 yaml_2.3.5 pillar_1.7.0 backports_1.4.1
## [13] lattice_0.20-45 glue_1.6.2 digest_0.6.29 rvest_1.0.2
## [17] colorspace_2.0-3 htmltools_0.5.2 pkgconfig_2.0.3 broom_0.8.0
## [21] haven_2.5.0 scales_1.2.0 tzdb_0.3.0 generics_0.1.2
## [25] farver_2.1.0 ellipsis_0.3.2 withr_2.5.0 lazyeval_0.2.2
## [29] cli_3.3.0 magrittr_2.0.3 crayon_1.5.1 evaluate_0.15
## [33] fs_1.5.2 fansi_1.0.3 MASS_7.3-58 xml2_1.3.3
## [37] truncnorm_1.0-8 tools_4.2.1 data.table_1.14.2 hms_1.1.1
## [41] lifecycle_1.0.1 munsell_0.5.0 reprex_2.0.1 compiler_4.2.1
## [45] jquerylib_0.1.4 rlang_1.0.2 grid_4.2.1 rstudioapi_0.13
## [49] htmlwidgets_1.5.4 crosstalk_1.2.0 labeling_0.4.2 rmarkdown_2.14
## [53] gtable_0.3.0 DBI_1.1.2 R6_2.5.1 lubridate_1.8.0
## [57] knitr_1.39 fastmap_1.1.0 utf8_1.2.2 stringi_1.7.6
## [61] parallel_4.2.1 Rcpp_1.0.8.3 vctrs_0.4.1 dbplyr_2.1.1
## [65] tidyselect_1.1.2 xfun_0.31